Introduction

Spatial frequency (SF) constitutes a pivotal component of visual stimuli encoding in the primate visual system, encompassing the number of grating cycles within a specific visual angle. Higher SF (HSF) corresponds to intricate details, while lower SF (LSF) captures broader information. Previous psychophysical studies have compellingly demonstrated the profound influence of SF manipulation on object recognition and categorization processes (Joubert et al., 2007; Schyns and Oliva, 1994; Craddock et al., 2013; Caplette et al., 2014; Cheung and Bar, 2014; Ashtiani et al., 2017). Saneyoshi and Michimata (2015) and Jahfari (2013) have highlighted the significance of HSF and LSF for categorical and coordinate processing in object recognition, respectively. The sequence in which SF content is presented also affects the categorization performance, with coarse-to-fine presentation leading to faster categorizations (Kauffmann et al., 2015). Considering face as a particular object, several studies showed that middle and higher SFs are more critical for face recognition (Costen et al., 1996; Hayes et al., 1986; Fiorentini et al., 1983; Cheung et al., 2008). Another vital theory suggested by psychophysical studies is the coarse-to-fine perception of visual stimuli, which states that LSF or global contents are processed faster than HSF or local contents (Schyns and Oliva, 1994; Rotshtein et al., 2010; Gao, 2011; Yardley et al., 2012; Kauffmann et al., 2015; Rokszin, 2016). Despite the extensive reliance on psychophysical studies to examine the influence of SF on categorization tasks, our understanding of SF representation within primate visual systems, particularly in higher visual areas like the inferior temporal (IT) cortex, remains constrained due to the limited research in this specific domain.

One of the seminal studies investigating the neural correlates of SF processing and its significance in object recognition was conducted by Bar (2003). Their research proposes a top-down mechanism driven by the rapid processing of LSF content, facilitating object recognition (Bar, 2003; Fenske et al., 2006). The exploration of SF representation has revealed the engagement of distinct brain regions in processing various SF contents (Fintzi and Mahon, 2014; Chaumon et al., 2014; Bermudez et al., 2009; Iidaka et al., 2004; Peyrin et al., 2010; Gaska et al., 1988; Bastin et al., 2013; Oram and Perrett, 1994). More specifically, the orbitofrontal cortex (OFC) has been identified as accessing global (LSF) and local (identity; HSF) information in the right and left hemispheres, respectively (Fintzi and Mahon, 2014). The V3A area exhibits low-pass tuning curves (Gaska et al., 1988), while HSF processing activates the left fusiform gyrus (Iidaka et al., 2004). Neural responses in the IT cortex, which play a pivotal role in object recognition and face perception, demonstrate correlations with the SF components of complex stimuli (Bermudez et al., 2009). Despite the acknowledged importance of SF as a critical characteristic influencing object recognition, a more comprehensive understanding of its representation is warranted. By unraveling the neural mechanisms underlying SF representation in the IT cortex, we can enrich our comprehension of the processing and categorization of visual information.

To address this issue, we investigate the SF representation in the IT cortex of two passive-viewing macaque monkeys. We studied the neural responses of the IT cortex to intact, SF-filtered (five ranges), and phase-scrambled stimuli. SF decoding is observed in both population- and single-level representations. Investigating the decoding pattern of individual SF bands reveals a course-to-fine manner in recall performance where LSF is decoded more accurately than HSF. Temporal dynamics analysis shows that SF coding exhibits a coarse-to-fine pattern, emphasizing faster processing of lower frequencies. Moreover, SF representation forms an average LSF-preferred tuning across neuron responses at 70ms to 170ms after stimulus onset. Then, the average preferred SF shifts monotonically to HSF in time after the early phase of the response, with its peak at 220ms after the stimulus onset. The LSF-preferred tuning turns into an HSF-preferred one in the late neuron response phase.

Next, we examined the relationship between SF and category coding. We found a strong positive correlation between SF and category coding performances in sub-populations of neurons. SF coding capability of individual neurons is highly correlated with the category coding capacity of the sub-population. Moreover, clustering neurons based on their SF responses indicates a relationship between SF representation and category coding. Employing the neuron responses to five SF ranges considering only the scrambled stimuli, an SF profile was identified for each neuron that predicts the categorization performance of that neuron in a population of the neurons sharing the same profile. Neurons whose response increases with increasing SF encode faces better than other neuron populations with other profiles.

Given the co-existence of SF and category coding within the IT cortex and the prediction capability of SF for category selectively, we examined the neural mechanisms underlying SF and category representation. In single-level, we found no correlation between SF and category coding capability of single neurons. At the population level, we found that the contribution of neurons to SF coding did not correlate with their contribution to category coding. Delving into the characteristics of SF coding, we found that individual neurons carry more independent SF-related information compared to the encoding of categories (face vs. non-face). Analyzing the temporal dynamics of each neuron’s contribution to population-level SF coding reveals a shift in sparsity during different phases of the response. In the early phase (70ms-170ms), the contribution is more sparse than category coding. However, this behavior is reversed in the late phase (170ms-270ms), with SF coding showing a less sparse contribution.

Finally, we compared the representation of SF in the IT cortex with several popular convolutional neural networks (CNNs). We found that CNNs exhibited robust SF coding capabilities with signifi-cantly higher accuracies than the IT cortex. Like the IT cortex, LSF content showed higher decoding performance than the HSF content. However, while there were similarities in SF representation, CNNs did not replicate the SF-based profiles predicting neuron category selectivity observed in the IT cortex. We posit that our findings establish neural correlates pertinent to behavioral investigations into SF’s role in object recognition. Additionally, our results shed light on how the IT cortex represents and utilizes SF during the object recognition process.

Results

SF coding in the IT cortex

To study the SF representation in the IT cortex, we designed a passive stimulus presentation task (Figure1a, see Materials and methods). The task comprises two phases: the selectivity and the main. During the selectivity phase, 155 stimuli, organized into two super-ordinate and four ordinate categories, were presented (with a 50ms stimulus presentation followed by a 500ms blank period, see Materials and methods). Next, the six most responsive stimuli are selected along with nine fixed stimuli (six faces and three non-face objects, Figure1b) to be presented during the main phase (35ms stimulus presentation followed by a 465ms blank, see Materials and methods). Each stimulus is phase scrambled, and then the intact and scrambled versions are filtered in five SF ranges (R1 to R5, with R5 representing the highest frequency band, Figure1b), resulting in a total of 180 unique stimuli presented in each session (see Materials and methods). Each session consists of 15 blocks, with each stimulus presented once per block in a random order. The IT neurons of passive viewing monkeys are recorded where the cells cover all areas of the IT area uniformly (Figure1a). We only considered the responsive neurons (see Materials and methods), totaling 266 (157 M1 and 109 M2). A sample neuron (neuron #155, M1) prestimulus time histogram (PSTH) is illustrated in Figure1c in response to the scrambled stimuli for R1, R3, and R5. R1 exhibits the most pronounced firing rate, indicating the highest neural activity level. In contrast, R5 displays the lowest firing rate, suggesting an LSF-preferred trend in the neuron’s response. To explore the SF representation and coding capability of IT neurons, each stimulus in each session block is represented by an N element vector where the i’th element is the average response of the i’th neuron to that stimulus within a 50ms time window (see Materials and methods).

Experimental design and SF coding.

a Experimental design. The design of the experiment involved the collection of responses from IT neurons to 15 stimuli (including six faces, three non-faces, and six selective stimuli, see Materials and methods) in six SF bands (intact and R1 to R5, see Materials and methods), and two versions (scrambled and unscrambled) using a passive presentation task. Presentation of blocks starts if the monkey preserves fixation for 200ms. Each block consisted of a 35ms stimulus presentation followed by a blank screen with a fixation point of 465ms, and each stimulus was presented 15 times. The recorded signals were sorted, and visually responsive neurons were selected (N = 266, see Materials and methods). b A sample of the fixed stimulus set. This panel shows three (out of six) faces, three non-faces, and one scrambled sample stimulus. Each row corresponds to an SF range starting with intact, followed by R1 to R5 (low to high SF). c A sample neuron. The PSTH of a sample neuron (N = 151, M1) for scrambled stimuli is depicted. To generate a response vector for a given stimulus or trial, the responses of each neuron were averaged in a 50ms time window centered around the relevant time point. The PSTH was smoothed using a Gaussian kernel with a standard deviation of 20ms. The responses of three SF bands (R1, R3, and R5) are shown for better illustration. d SF coding exists in the IT cortex. The decoding performance of SF ranges using scrambled stimuli is shown over time. Single-level and population-level representations were fed into an LDA algorithm to predict the SF range of the scrambled stimuli. Shadows illustrate the SEM and STD for single and population levels, respectively. This figure highlights the presence of SF coding in both individual and population neural activity. e LSF-preferred nature of SF coding. The population recall of each SF band in response to scrambled stimuli, determined using the LDA method, is presented. The error bars indicate the STD. The results demonstrate a decreasing trend as SF moves towards higher frequencies, suggesting a coarse-to-fine decoding preference.

To assess whether individual neurons encode SF-related information, we utilized the linear discriminant analysis (LDA) method to predict the SF range of the scrambled stimuli based on neuron responses (see Materials and methods). Figure1d displays the average time course of SF discrimination accuracy across neurons. At single-level, the accuracy surpasses the chance level by an average of 4.02% at 120 ms after stimulus onset. We only considered neurons demonstrating at least three consecutive time windows with accuracy significantly greater than the chance level, resulting in a subset of 105 neurons. The maximum accuracy of a single neuron was 19.08% higher than the chance level (neuron #193, M2). Subsequently, the SF decoding performance of the IT population is investigated (R1 to R5 and scrambled stimuli only, see Materials and methods). Figure1d also illustrates the SF classification accuracy across time in population-level representations. The peak accuracy is 24.68% higher than the chance level at 115ms after the stimulus onset. These observations indicate the explicit presence of SF coding in the IT cortex. To determine the discrimination of each SF range, Figure1e shows the recall of each SF content for the time window of 70ms to 170ms after stimulus onset. This observation reveals an LSF-preferred decoding behavior across the IT population (normalized recall, R1=0.47±0.04, R2=0.36±0.03, R3=0.30±0.03, R4=0.32±0.04, R5=0.30±0.03, and R1 > R5, p-value<0.001).

Temporal dynamics of SF representation

The sample neuron and recall values in Figure1 indicate an LSF-preferred neuron response. To explore this behavior over time, we analyzed the temporal dynamics of SF representation. Figure2a illustrates the onset of SF recalls, revealing a coarse-to-fine trend where R1 is decoded faster than R5 (onset times in milliseconds after stimulus onset, R1=84.5±3.02, R2=86.0±4.4, R3=88.9±4.9, R4=86.5±4.1, R5=97.15±4.9, R1 < R2, p-value<0.001). Figure2b illustrates the time course of the average preferred SF across the neurons. To calculate the preferred SF for each neuron, we multiplied the firing rate by the SF range and normalized the values (see Materials and methods). Figure2b demonstrates that following the early phase of the response (70ms to 170ms), the average preferred SF shifts towards HSF, reaching its peak at 215ms after stimulus onset (preferred SF, 0.54±0.15). Furthermore, a second peak emerges at 320 ms after stimulus onset (preferred SF, 0.22±0.16), indicating a shift in the average preferred SF in the IT cortex towards higher frequencies. To analyze this shift, we divided the time course into two intervals of 70ms to 170ms, where the response peak of the neurons happens, and 170ms to 270ms, where the first peak of SF preference occurs. We calculated the percentage of the neurons that significantly responded to a specific SF range higher than others (one-way ANOVA with a significance level of 0.05, see Materials and methods) for the two time intervals. Figure2c and d show the percentage of the neurons in each SF range for the two time steps. In the early phase of the response (T1, 70ms to 170ms), the highest percentage of the neurons belong to R1, 40.19%, and a decreasing trend is observed as we move towards higher frequencies (R1=40.19%, R2=19.60%, R3=13.72%, R4=10.78%, R5=15.68%). Moving to T2, the percentage of neurons responding to R1 higher than the others remains stable, dropping to 38.46%. The number of neurons in R2 also drops to under 5% from 19.60% observed in T1. On the other hand, the percentage of the neurons in R5 reaches 46.66% in T2 compared to 15.68% in T1 (higher than R1 in T1). This observation indicates that the increase in preferred SF is due to a substantial increase in the selective neurons to HSF, while the response of the neurons to R1 is roughly unchanged. To further understand the population response to various SF ranges, the average response across neurons for R1 to R5 is depicted in Figure2c and d (bottom panels). In the first interval, T1, an average LSF-preferred tuning is observed where the average neuron response decreases as the SF increases (normalized firing rate for R1=1.09±0.01, R2=1.05±0.01, R3=1.03±0.01, R4=1.03±0.02, R5=1.00±0.01, Bonferroni corrected p-value for R2<R5, 0.006). During the second time interval, excluding R1, the decreasing pattern transformed to an increasing one, with the response to R5 surpassing that of R1 (normalized firing rate for R1=0.80±0.02, R2=0.73±0.02, R3=0.76±0.02, R4=0.81±0.02, R5=0.84±0.01, Bonferroni corrected p-value for R2<R4, 0.022, R2<R5, 0.0003, and R3<R5, 0.03). These observations illustrate an LSF-preferred tuning in the early phase of the response, shifting towards HSF-preferred tuning in the late response phase.

The temporal dynamics of SF representation.

a Course-to-fine nature of SF coding. The onset time of the recall of each SF range in scrambled stimuli is illustrated, with error bars indicating the STD. The results suggest that the onset time of decoding increases as SF increases. b SF preference shifts toward higher frequencies over time. The time course of the average preferred SF (see Materials and methods) across neurons is illustrated. The average preferred SF of IT neurons moves towards higher frequencies from 170ms after stimulus onset, reaching its highest value at 220ms. A second peak emerges at 320ms following the stimulus onset. The SF preference shows a monotonic increase followed by a decrease in time. c,d Shift in neural response towards HSF. The average response of all neurons within the two time intervals (T1 and T2 in panel b) is shown, with error bars indicating the SEM. c In T1, from 70ms to 170ms after stimulus onset, a decreasing response of the neurons is observed as the SF content shifts towards higher frequencies. The relative percentage of neurons showing stronger responses to SF ranges (R1 to R5) in T1 is depicted in the inner top panel. R1 is the most responsive SF for roughly 40% of the neurons. d In the following interval (T2, 170ms to 270ms), an increasing tuning is observed from R2 to R5, where R5 elicits the highest firing rates. Furthermore, in T2, there is a roughly threefold increase in the percentage of neurons exhibiting stronger responses to R5 compared to T1, indicating a shift in the neurons’ responses towards HSF (top panel).

SF profile predicts category coding

Our findings indicate explicit SF coding in the IT cortex. Given the co-existence of SF and category coding in this region, we examine the relationship between SF and category codings. As depicted in Figure2, while the average preferred SF across the neurons shifts to HSF, the most responsive SF range varies across individual neurons. To investigate the relation between SF representation and category coding, we identified an SF profile by fitting a quadratic curve to the neuron responses across SF ranges (R1 to R5, phase-scrambled stimuli only). Then, according to the fitted curve, an SF profile is determined for each neuron (see Materials and methods). Five distinct profiles were identified based on the tuning curves (Figure3a): i) flat, where the neuron has no preferred SF (not included in the results), ii) LSF preferred (LP), where the neuron response decreases as SF increases, iii) HSF preferred (HP), where neuron response increases as the SF shifts towards higher SFs, iv) U-shaped where the neuron response to middle SF is lower than that of HSF or LSF, and v) inverse U-shaped (IU), where the neuron response to middle SF is higher than that of LSF and HSF. The U-shaped and HSF-preferred profiles represent the largest and smallest populations, respectively. Following profile identification, the object coding capability of each profile population is assessed. Here, instead of LDA, we employ the separability index (SI) introduced by Dehaqani et al. (2016), because of the LDA limitation in fully capturing the information differences between groups as it categorizes samples as correctly classified or misclassified.

SF profile predicts category coding.

a,b SF profile predicts category selectivity. a The responses of each neuron were standardized by subtracting the mean and dividing by the standard deviation of the baseline time. Neurons were then categorized into four groups based on the fitting of a quadratic function to their responses (see Materials and methods). Each panel presents the average neuron responses within each category for SF ranges R1 to R5, with error bars indicating the SEM of the response values. The percentage of the neurons in each category is displayed at the top of each panel. The “flat” category, where the response to no SF was higher than others, was excluded from this analysis. b SI of face/non-face vs. scrambled stimuli is illustrated (see Materials and methods). The SI value and SF profile are determined within the time window of 70ms to 170ms after stimulus onset. The HSF-preferred population exhibited significantly higher face SI compared to the other groups. The LSF-preferred population displayed a significant difference in face and non-face SI. On the other hand, the IU profile indicates a significantly higher SI value for the non-face compared to the face. The U-shaped profile did not show any significant differences between the face and the non-face. These results suggest that the neuron’s response to various SF bands can predict its decoding capability. c,d The relation between SF and category coding in sub-populations. Initially, the LDA method was employed to calculate the individual neuron’s performance in the single-level category and SF coding. Next, a sorting procedure based on SF (panel c) and category (panel d) coding performances were conducted to create sub-populations of neurons exhibiting similar capabilities (see Materials and methods). The scatter plot of the category and SF coding accuracy of these sub-populations demonstrated a notably high degree of positive correlation between SF and category accuracies in the IT cortex.

To examine the face and non-face information separately, SI is calculated for face vs. scrambled and non-face vs. scrambled. Figure3a displays the identified profiles and Figure3b indicates the average SI value during 70ms to 170ms after the stimulus onset. The HSF preferred profile shows significantly higher face information compared to other profiles (face SI for LP=0.58±0.03, HP=0.89±0.05, U=0.07±0.01, IU=0.07±0.01, HP > LP, U, IU with p-value < 0.001) and than non-face information in all other profiles (non-face SI for LP=0.04±0.01, HP=0.02±0.01, U=0.19±0.03, IU=0.08±0.02, and face SI in HP is greater than non-face SI in all groups with p-value < 0.001). This observation underscores the importance of middle and higher frequencies for face representation. The LSF-preferred profile also exhibits significantly higher face SI than non-face objects (p-value<0.001). On the other hand, in the IU profile, non-face information surpasses face SI (p-value<0.001), indicating the importance of middle frequency for the non-face objects. Finally, in the U profile, there is no significant difference between the face and non-face objects (face vs. non-face p-value=0.36).

To examine the relation between SF and the category coding capacity of the neurons, the performance of individual neurons is considered (Figure3c and d). Neurons were sorted based on their SF or category performances, resulting in two separate groups of ranksone for SF and another for category. Subsequently, we selected sub-populations of neurons with similar ranks according to SF or category (see Materials and methods). Each sub-population comprises 20 neurons with approximately similar SF (or category) performance levels. Then, the SF and category decoding accuracy is calculated for each sub-population. The scatterplot of individual vs. sub-population accuracy demonstrated a significant positive correlation between the sub-population performance and the accuracy of individual neurons within those populations. Specifically, the correlation value for SF-sorted and category-sorted groups is 0.66 (p-value=10−31) and 0.39 (p-value=10−10), respectively. This observation illustrates that SF coding capacity at single-level representations significantly predicts category coding capacity at the population level.

Uncorrelated mechanisms for SF and category coding

As both SF and category coding exist in the IT cortex at both the single neuron and population levels, we investigated their underlying coding mechanisms. Figure4a displays the scatter plot of SF-category coding capacity for individual neurons. The correlation between SF and category accuracy across individual neurons shows no significant relationship (correlation: 0.024 and p-value: 0.53), suggesting two uncorrelated mechanisms for SF and category coding. To explore the population-level coding, we considered neuron weights in the LDA classifier as indicators of each neuron’s contribution to population coding. Figure4b indicates the scatter plot of the neuron’s weights in SF and category decoding. The LDA weights reveal no correlation between the patterns of neuron contribution in population decoding of SF and category (correlation=0.002 and p-value=0.39). These observations indicate uncorrelated coding mechanisms for SF and category in both single and population-level representations in the IT cortex.

Uncorrelated mechanisms for SF and category coding.

a uncorrelated SF and category coding in the single level. The scatter plot indicates the category-SF accuracies and does not reveal a significant correlation between SF and category coding capabilities within the IT cortex at the single-neuron level. The error bars show the STD for SF and category decoding accuracies. b uncorrelated neuron contribution in SF and category coding in population. The LDA weight of each neuron is considered as the neuron contribution in the population coding of SF or category (see Materials and methods). The scatter plot of the neuron weights in SF shows a near-zero correlation with the neuron weights in category coding.

Next, to investigate SF and category coding characteristics, we systematically removed individual neurons from the population and measured the resulting drop in LDA classifier accuracy as a metric for the neuron’s impact, termed single neuron contribution (SNC). Figure5a illustrates the SNC score for SF and category (face vs. non-face) decoding within 70ms to 170ms after the stimulus onset. The SNC in SF is significantly higher than for category (average SNC for SF=0.0051±0.0002 and category=0.001±0.0004, SF > category with p-value=1.6 × 1.6−13). Therefore, SF representation relies more on individual neuron representations, suggesting a sparse mechanism of SF coding where single-level neuron information is less redundant. In contrast, single-level representations of category appear to be more redundant and robust against information loss or noise at the level of individual neurons. We utilized conditional mutual information (CMI) between pairs of neurons conditioned on the label (SF or category) to assess the information redundancy across the neurons. CMI quantifies the shared information between the population of two neurons regarding SF or category coding. Figure5b indicates a significantly lower CMI for SF (average CMI for SF=0.66±0.0009 and category=0.69±0.0007, SF<category with p-value≈0), indicating that neurons carry more independent SF-related information than category-related information.

Sparse SF coding compared to category coding.

a,b Sparse mechanism for SF coding. a The contribution of each neuron in SF and category (face vs. non-face) decoding is evaluated by removing it from the feature set fed to the LDA within the time window of 70ms to 170ms after stimulus onset. The histogram of the SNC value (see Materials and methods) is presented, indicating the amount of accuracy loss when a neuron is removed. The bar plot displays the average SNC values for SF and category, with error bars representing the SEM. The SNC value for SF is significantly higher than for the category. b Furthermore, the CMI of each neuron pair, conditioned to the label (category or SF), is illustrated. CMI reflects the information redundancy between neuron pairs during SF or category decoding. A lower CMI value for SF indicates that individual neurons carry more independent SF-related information compared to category information. c Sparse neuron contribution in SF coding at the early phase of the response. To investigate the contribution of the neurons in population decoding, the sparseness of the LDA weights assigned to each neuron is calculated. Higher sparseness indicates a greater contribution of a smaller group of neurons to the decoding process. The time course of weight sparseness is depicted for SF and category (face vs. non-face) decoding, with shadows representing the STD. During the early phase of the response, the sparseness of SF-related weights is higher than that of the category, while this relationship is reversed during the late phase of the response.

To investigate each neuron’s contribution to the decoding procedure (LDA decision), we computed the sparseness of the LDA weights corresponding to each neuron (see Materials and methods). For SF, we trained the LDA on R1, R2, R4, and R5 with two labels (one for R1 and R2 and the alternative for R3 and R4). A second LDA was trained to discriminate between faces and non-faces. Subsequently, we calculated the sparseness of the weights associated with each neuron in SF and category decoding. Figure5c illustrates the time course of the weight sparseness for SF and category. The category reflects a bimodal curve with the first peak at 110ms and the second at 210ms after stimulus onset. The second peak is significantly larger than the first one (category first peak, 0.016±0.007, second peak, 0.051±0.013, and p-value<0.001). In SF decoding, neurons’ weights exhibit a trimodal curve with peaks at 100ms, 215ms, and 320ms after the stimulus onset. The first peak is significantly higher than the other two (SF first peak, 0.038±0.005, second peak, 0.018±0.003, third peak, 0.028±0.003, first peak > second peak with p-value<0.001, and first peak > third peak with p-value=0.014). Comparing SF and category, during the early phase of the response (70ms to 170ms), SF sparseness is higher, while in 170ms to 270ms, the sparseness value of the category is higher (p-value < 0.001 for both time intervals). This suggests that, initially, most neurons contribute to category representation, but later, the majority of neurons are involved in SF coding. These findings support distinct mechanisms governing SF and category coding in the IT cortex.

SF representation in the artificial neural networks

We conducted a thorough analysis to compare our findings with CNNs. To assess the SF coding capabilities of CNNs, we utilized popular architectures, including ResNet18, ResNet34, VGG11, VGG16, InceptionV3, EfficientNetb0, CORNet-S, CORTNet-RT, and CORNet-z, with both pre-trained on ImageNet and randomly initialized weights (see Materials and methods). Employing feature maps from the four last layers of each CNN, we trained an LDA model to classify the SF content of input images. The results indicated that CNNs exhibit SF coding capabilities with much higher accuracies than the IT cortex. Figure6a shows the SF decoding accuracy of the CNNs on our dataset (SF decoding accuracy with random (R) and pre-trained (P) weights, ResNet18: P=0.96±0.01 / R=0.94±0.01, ResNet34 P=0.95±0.01 / R=0.86±0.01, VGG11: P=0.94±0.01 / R=0.93±0.01, VGG16: P=0.92±0.02 / R=0.90±0.02, InceptionV3: P=0.89±0.01 / R=0.67±0.03, EfficientNetb0: P=0.94±0.01 / R=0.30±0.01, CORNet-S: P=0.77±0.02 / R=0.36±0.02, CORTNet-RT: P=0.31±0.02 / R=0.33±0.02, and CORNet-z: P=0.94±0.01 / R=0.97±0.01). Except for CORNet-z, object recognition training increases the network’s capacity for SF coding, with an improvement as significant as 64% in EfficientNetb0. Furthermore, except for the CORNet family, LSF content exhibits higher recall values than HSF content, as observed in the IT cortex (p-value with random (R) and pre-trained (P) weights, ResNet18: P=0.39 / R=0.06, ResNet34 P=0.01 / R=0.01, VGG11: P=0.13 / R=0.07, VGG16: P=0.03 / R=0.05, InceptionV3: P=<0.001 / R=0.05, EfficientNetb0: P=0.07 / R=0.01). The recall values of CORNet-Z and ResNet18 are illustrated in Figure6b. However, while the CNNs exhibited some similarities in SF representation with the IT cortex, they did not replicate the SF-based profiles that predict neuron category selectivity. As depicted in Figure6c, although neurons formed similar profiles, these profiles were not associated with the category decoding performances of the neurons sharing the same profile.

SF representation in CNNs.

a SF coding capabilities. We assessed the SF coding capabilities of popular CNN architectures (ResNet18, ResNet34, VGG11, VGG16, InceptionV3, EfficientNetb0, CORNet-S, CORTNet-RT, and CORNet-z) using both randomly initialized (R) and pre-trained (P) weights on ImageNet. An LDA model was trained using feature maps from the four last layers of each CNN to classify the SF content of input images. The SF decoding accuracy for each CNN on our dataset is presented with error bars indicating the STD. b LSF-preferred recall performance. The recall performance of two sample networks (CORNET-z and ResNet18) is presented. STD values are illustrated with error bars. The recall values for LSF content were higher than HSF content in most CNNs, resembling the trends observed in the IT cortex. c The Profiles (left) and face/non-face SI value (right) of a sample network (ResNet18). Profiles are calculated similarly to the IT cortex. CNNs did not replicate the SF-based Profiles observed in the IT cortex.

Discussion

Utilizing neural responses from the IT cortex of passive-viewing monkeys, we conducted a study on SF representation within this pure visual high-level area. Numerous psychophysical studies have underscored the significant impact of SF on object recognition, highlighting the importance of its representation. To the best of our knowledge, this study presents the first attempt to systematically examine the SF representation in a high-level area, i.e., the IT cortex, using extracellular recording. Understanding SF representation is crucial, as it can elucidate the object recognition procedure in the IT cortex.

Our findings demonstrate explicit SF coding at both the single-neuron and population levels, with LSF being decoded faster and more accurately than HSF. During the early phase of the response, we observe a preference for LSF, which shifts toward a preference for HSF during the late phase. Next, we made profiles based on SF-only (phase-scrambled stimuli) responses for each neuron to predict its category selectivity. Our results show a direct relationship between the population’s category coding capability and the SF coding capability of individual neurons. While we observed a relation between SF and category coding, we have found uncorrelated representations. Unlike category coding, SF relies more on sparse, individual neuron representations. Finally, when comparing the responses of IT with those of CNNs, it is evident that while SF coding exists in CNNs, the SF profile observed in the IT cortex is notably absent. Our results are based on grouping the neurons of the two monkeys; however, the results remain consistent when looking at the data from individual monkeys.

The influence of SF on object recognition has been extensively investigated through psychophysical studies (Joubert et al., 2007; Schyns and Oliva, 1994; Craddock et al., 2013; Caplette et al., 2014; Cheung and Bar, 2014; Ashtiani et al., 2017). One frequently explored theory is the coarseto-fine nature of SF processing in object recognition (Schyns and Oliva, 1994; Rotshtein et al., 2010; Gao, 2011; Yardley et al., 2012; Kauffmann et al., 2015; Rokszin, 2016). This aligns with our observation that the onset of LSF is significantly lower than HSF. Different SF bands carry distinct information, progressively conveying coarse-to-fine shape details as we transition from LSF to HSF. Psychophysical studies have indicated the utilization of various SF bands for distinct categorization tasks (Rotshtein et al., 2010). Considering the face as a behaviorally demanded object, psychophysical studies have observed the influence of various SF bands on face recognition. These studies consistently show that enhanced face recognition performance is achieved in the middle and higher SF bands compared to LSF (Costen et al., 1996; Hayes et al., 1986; Fiorentini et al., 1983; Cheung et al., 2008; Awasthi, 2012; Jeantet, 2019). These observations resonate with the identified SF profiles in our study. Neurons that exhibit heightened responses as SF shifts towards HSF demonstrate superior coding of faces compared to other neuronal groups.

Unlike psychophysical studies, imaging studies in this area have been relatively limited. Gaska et al. (1988) observed low-pass tuning curves in the V3A area, and Chen et al. (2018) reported an average low-pass tuning curve in the superior colliculus (SC). Purushothaman et al. (2014) identified two distinct types of neurons in V1 based on their response to SF. The majority of neurons in the first group exhibited a monotonically shifting preference toward HSF over time. In contrast, the second group showed an initial increase in preferred SF followed by a decrease. Our findings align with these observations, showing a rise in preferred SF starting at 170ms after stimulus onset, followed by a decline at 220ms after stimulus onset. Additionally, Zhang et al. (2023) found that LSF is the preferred band for over 40% of V4 neurons. This finding is also consistent with our observations, where approximately 40% of neurons consistently exhibited the highest firing rates in response to LSF throughout all response phases. Collectively, these results suggest that the average LSF preferred tuning curve observed in the IT cortex could be a characteristic inherited from the lower areas in the visual hierarchy. Moreover, examining the course-to-fine theory of SF processing, Chen et al. (2018) and Purushothaman et al. (2014) observed a faster response to LSF compared to HSF in SC and V1, which resonates with our course-to-fine observation in SF decoding. When analyzing the relationship between the SF content of complex stimuli and IT responses, Bermudez et al. (2009) observed a correlation between neural responses in the IT cortex and the SF content of the stimuli. This finding is in line with our observations, as decoding results directly from the distinct patterns exhibited by various SF bands in neural responses.

Hong et al. (2016) suggested that the neural mechanisms responsible for developing tolerance to identity-preserving transform also contribute to explicitly representing these category-orthogonal transforms, such as rotation. Extending this perspective to SF, our results similarly suggest an explicit representation of SF within the IT population. However, unlike transforms such as rotation, the neural mechanisms in IT leverage various SF bands for various categorization tasks. Furthermore, our analysis introduced a novel SF-only profile for the first time predicting category selectivity.

These findings prompt the question of why the IT cortex explicitly represents and codes the SF content of the input stimuli. In our perspective, the explicit representation and coding of SF contents in the IT cortex facilitates object recognition. The population of the neurons in the IT cortex becomes selective for complex object features, combining SFs to transform simple visual features into more complex object representations. However, the specific mechanism underlying this combination is yet to be known. The diverse SF contents present in each image carry valuable information that may contribute to generating expectations in predictive coding during the early phase, thereby facilitating information processing in subsequent phases. This top-down mechanism is suggested by the works of Bar (2003) and Fenske et al. (2006).

Moreover, each object has a unique “characteristic SF signature,” representing its specific arrangement of SFs. “Characteristic SF signatures” refer to the unique patterns or profiles of SFs associated with different objects or categories of objects. When we look at visual stimuli, such as objects or scenes, they contain specific arrangements of different SFs. Imagine a scenario where we have two objects, such as a cat and a car. These objects will have different textures and shapes, which correspond to different distributions of SFs. The cat, for instance, might have a higher concentration of mid-range SFs related to its fur texture, while the car might have more pronounced LSFs that represent its overall shape and structure. The IT cortex encodes these signatures, facilitating robust discrimination and recognition of objects based on their distinctive SF patterns.

The concept of characteristic SF signatures is also related to the “SF tuning” observed in our results. Neurons in the visual cortex, including the IT cortex, have specific tuning preferences for different SFs. Some neurons are more sensitive to HSF, while others respond better to LSF. This distribution of sensitivity allows the visual system to analyze and interpret different information related to different SF components of visual stimuli concurrently. Moreover, the IT cortex’s coding of SF can contribute to object invariance and generalization. By representing objects in terms of their SF content, the IT cortex becomes less sensitive to variations in size, position, or orientation, ensuring consistent recognition across different conditions. SF information also aids the IT cortex in categorizing objects into meaningful groups at various levels of abstraction. Neurons can selectively respond to shared SF characteristics among different object categories (assuming that objects in the same category share a level of SF characteristics), facilitating decision-making about visual stimuli. Overall, we posit that SF’s explicit representation and coding in the IT cortex enhance its proficiency in object recognition. By capturing essential details and characteristics of objects, the IT cortex creates a rich representation of the visual world, enabling us to perceive, recognize, and interact with objects in our environment.

Finally, we compared SF’s representation within the IT cortex and the current state of the art networks in deep neural networks. CNNs stand as one of the most promising models for comprehending visual processing within the primate ventral visual processing stream (Kubilius et al., 2018, 2019). Examining the higher layers of CNN models (most similar to IT), we found that randomly initialized and pre-trained CNNs can code for SF. This is consistent with our previous work on the CIFAR dataset (Toosi et al., 2022). Nevertheless, they do not exhibit the SF profile we observed in the IT cortex. This emphasizes the uniqueness of SF coding in the IT cortex and suggests that artificial neural networks might not fully capture the complete complexity of biological visual processing mechanisms, even when they encompass certain aspects of SF representation. Our results intimate that the IT cortex uses a different mechanism for SF coding compared to contemporary deep neural networks, highlighting the potential for innovating new approaches to consider the role of SF in the ventral stream models.

Our results are not affected by several potential confounding factors. First, each stimulus in the set also has a corresponding phase-scrambled variant. These phase-scrambled stimuli maintain the same SF characteristics as their respective face or non-face counterparts but lack shape information. This approach allows us to investigate SF representation in the IT cortex without the confounding influence of shape information. Second, our results, obtained through a passive viewing task, remain unaffected by attention mechanisms. Third, All stimuli (intact, SF filtered, and phase scrambled) are corrected for illumination and contrast to remove the attribution of the category-orthogonal basic characteristics of stimuli into the results (see Materials and methods). Fourth, while our dataset does not exhibit a balance in samples per category, it is imperative to acknowledge that this imbalance does not exert an impact on our observed outcomes. We have equalized the number of samples per category when training our classification models by random sampling from the stimulus set (see Materials and methods). One limitation of our study is the relatively low number of objects in the stimulus set. However, the decoding performance of category classification (face vs. non-face) in intact stimuli is 94.2%, which attests to the high level of generalizability and validity characterizing our results.

In summary, we studied the SF representation within the IT cortex. Our findings reveal the existence of a sparse mechanism responsible for encoding SF in the IT cortex. Moreover, we studied the relationship between SF representation and object recognition by identifying an SF profile that predicts object recognition performance. These findings establish neural correlates of the psychophysical studies on the role of SF in object recognition and shed light on how IT represents and utilizes SF for the purpose of object recognition.

Materials and methods

Animals and recording

The activity of neurons in the IT cortex of two male macaque monkeys weighing 10 and 11 kg, respectively, was analyzed following the National Institutes of Health Guide for the Care and Use of Laboratory Animals and the Society for Neuroscience Guidelines and Policies. The experimental procedures were approved by the Institute of Fundamental Science committee. Before implanting a recording chamber in a subsequent surgery, magnetic resonance imaging and Computed Tomography (CT) scans were performed to locate the prelunate gyrus and arcuate sulcus. The surgical procedures were carried out under sterile conditions and Isoflurane anaesthesia. Each monkey was fitted with a custom-made stainless-steel chamber, secured to the skull using titanium screws and dental acrylics. A craniotomy was performed within the 30×70mm chamber for both monkeys, with dimensions ranging from 5 mm to 30 mm A/P and 0 mm to 23 mm M/L.

During the experiment, the monkeys were seated in custom-made primate chairs, and their heads were restrained while a tube delivered juice rewards to their mouths. The system was mounted in front of the monkey, and eye movements were captured at 2KHz using the EyeLink PM-910 Illuminator Module and EyeLink 1000 Plus Camera (SR Research Ltd, Ottawa, CA). Stimulus presentation and juice delivery were controlled using custom software written in MATLAB with the MonkeyLogic toolbox. Visual stimuli were presented on a 24-inch LED-lit monitor (AsusVG248QE, 1920 x 1080, 144 Hz) positioned 65.5 cm away from the monkeys’ eyes. The actual time the stimulus appeared on the monitor was recorded using a photodiode (OSRAM Opto Semiconductors, Sunnyvale, CA).

One electrode was affixed to a recording chamber and positioned within the craniotomy area using the Narishige two-axis platform, allowing for continuous electrode positioning adjustment. To make contact with or slightly penetrate the dura, a 28-gauge guide tube was inserted using a manual oil hydraulic micromanipulator from Narishige, Tokyo, Japan. For recording neural activity extracellularly in both monkeys, varnish-coated tungsten microelectrodes (FHC, Bowdoinham, ME) with a shank diameter of 200–250 µm and impedance of 0.2–1 Mw (measured at 1kHz) were inserted into the brain. A pre-amplifier and amplifier (Resana, Tehran, Iran) were employed for single-electrode recordings, with filtering set between 300 Hz and 5 KHz for spikes and 0.1 Hz and 9 KHz for local field potentials. Spike waveforms and continuous data were digitized and stored at a sampling rate of 40 kHz for offline spike sorting and subsequent data analysis. Area IT was identified based on its stereotaxic location, position relative to nearby sulci, patterns of gray and white matter, and response properties of encountered units.

Stimulus and task paradigm

The experimental task comprised two distinct phases: selectivity and main phases, each involving different stimuli. During the selectivity phase, the objective was to identify a responsive neuron for recording purposes. If an appropriate neuron was detected, the main phase was initiated. However, if a responsive neuron was not observed, the recording location was adjusted, and the selectivity phase was repeated. First, we will outline the procedure for stimulus construction, followed by an explanation of the task paradigm.

The stimulus set

The size of each image was 500 x 500 pixels. Images were displayed on a 60 Hz monitor with a resolution of 1920 x 1080 pixels. The monitor’s response time (changing the color of pixels in grey space) was one millisecond. The monkey’s eyes were located at a distance of 65cm from the monitor. Each stimulus occupied a space of 5 x 5 degrees. All images were displayed in the center of the monitor. During the selectivity phase, a total number of 155 images were used as stimuli. Regarding SF, the stimuli were divided into unfiltered and filtered categories. Unfiltered images included 74 separate grayscale images in the categories of animal face, animal body, human face, human body, man-made and natural. To create the stimulus, these images were placed on a grey background with a value of 0.5. The filtered images included 27 images in the same categories as the previous images, which were filtered in two frequency ranges (along with the intact form): low (1 to 10 cycles per image) and high (18 to 75 cycles per image), totaling 81 images. In the main phase of the test, nine images, including three non-face images and six face images, were considered. These images were displayed in Figure1c. For the main phase, the images were filtered in five frequency ranges. These intervals were 1 to 5, 5 to 10, 10 to 18, 18 to 45, and 45 to 75 cycles per image. For each image in each frequency range, a scrambled version had been obtained by scrambling the image phase in the Fourier transforms domain. Therefore, each image in the main phase contained one unfiltered version (intact), five filtered versions (R1 to R5), and six scrambled versions (i.e., 12 versions in total).

SF filtering

Butterworth filters were used to filter the images in this study. A low-pass Butterworth filter is defined as follows.

where Blp is the absolute value of the filter, r is the distance of the pixel from the center of the image, fc is the filter’s cutoff frequency in terms of cycles per image, and n is the order of the filter. Similarly, the high-pass filter is defined as follows.

To create a band-pass filter with a pass frequency of f1 and a cutoff frequency of f2, a multiplication of a high-pass and a low-pass filter was performed (Bbp(r, f1, f2) = Blp(r; f1) × Bhp(r; f2)). To apply the filter, the image underwent a two-dimensional Fourier transform, followed by multiplication with the appropriate filter. Subsequently, the inverse Fourier transform was employed to obtain the filtered image. Afterward, a linear transformation was applied to adjust the brightness and contrast of the images. Brightness was determined by the average pixel value of the image, while contrast was represented by its standard deviation (STD). To achieve specific brightness (L) and contrast (C) levels, the following equation was employed to correct the images.

where σ and µ are the STD and mean of the image. In this research, specific values for L and C were chosen as 0.5 (corresponding to 128 on a scale of 255) and 0.0314 (equivalent to 8 on a scale of 255), respectively. Analysis of Variance (ANOVA) indicated no significant difference in brightness and contrast among various groups, with p-values of 0.62 for brightness and 0.25 for contrast. To create scrambled images, the original image underwent Fourier transformation, after which its phase was scrambled. Subsequently, the inverse Fourier transform was applied. Since the resulting signal may not be real, its real part was extracted. The resulting image then underwent processing through the specified filters in the primary phase.

Task paradigm

The task was divided into two distinct phases: the selectivity phase and the main phase. Each phase comprised multiple blocks, each containing two components: the presentation of a fixation point and a stimulus. The monkey was required to maintain fixation within a window of ±1.5 degrees around the center of the monitor throughout the entire task. During the selectivity phase, there were five blocks, and stimuli were presented randomly within each block. The duration of stimulus presentation was 50ms, while the fixation point was presented for 500ms. The selectivity phase consisted of a total of 775 trials. A neuron was considered responsive if it exhibited a significant increase in response during the time window of 70ms to 170ms after stimulus onset, compared to a baseline window of −50ms to 50ms. This significance was determined using the Wilcoxon signed-rank test with a significance level of 0.05. Once a neuron was identified as responsive, the main phase began. In the main phase, there were 15 blocks. The main phase involved a combination of the six most responsive stimuli, selected from the selectivity phase, along with nine fixed stimuli. In each block, all stimuli were presented once in random order. The stimulus duration in the main phase was 35ms, and the fixation point was presented for 465ms. For the purpose of analysis, our focus was primarily on the main phase of the task.

Neural representation

All analyses were conducted using custom code developed in Matlab (MathWorks). In total, 266 neurons (157 M1 and 109 M2) were considered for the analysis. Neurons were sorted using the ROSS toolbox (Toosi et al., 2021). Each stimulus in each time step was represented by a vector of N elements where the i’th element was the average response of the i’th neuron for that stimulus in a time window of 50ms around the given time step. We used both single-level and population-level analysis. Numerous studies had examined the benefits of population representation (Averbeck et al., 2006; Adibi et al., 2014; Abbott and Dayan, 1999; Dehaqani et al., 2018). These studies have demonstrated that enhancing signal correlation within the neural data population leads to improved decoding performance for object discrimination. To maintain consistency across trials, responses were normalized using the z-score procedure. All time courses were based on a 50ms sliding window with a 5ms time step. We utilized a time window from 70 ms to 170 ms after stimulus onset for our analysis (except for temporal analysis). This time window was selected because the average firing rate across neurons was significantly higher than the baseline window of −50 ms to 50 ms (Wilcoxon signed-rank test, p-value < 0.05).

Statistical analysis

All statistical analyses were conducted as outlined in this section unless otherwise specified. In the single-level analysis, where each run involves a single neuron, pair comparisons were performed using the Wilcoxon signed-rank test, and unpaired comparisons utilized the Wilcoxon rank-sum test, both at a significance level of 0.05. The results and their standard error of the mean (SEM) were reported. For population analysis, we used an empirical method, and the results were reported with their STD. To compare two paired sets of X and Y (Y could represent the chance level), we calculated the statistic r as the number of pairs where xy < 0. The p-value was computed as r divided by the total number of runs, r/M, where M is the total number of runs. When r = 0, we used the notation of p-value<1/M.

Classification

All classifications were carried out employing the LDA method, both in population and single level. As described before, each stimulus in each block was shown by an N-element vector to be fed to the classifier. For face (non-face) vs. scrambled classification, only the face (non-face) and scrambled intact stimuli were used. For face vs. non-face (category) classification, only unscrambled intact stimuli were utilized. Finally, only the scrambled stimuli were fed to the classifier for the SF classification, and the labels were R1 to R5. In population-level analysis, averages and standard deviations were computed using a leave-p-out method, where 30% of the samples were kept as test samples in each run. All analyses were based on 1000 leave-p-out runs. To determine the onset time, one STD was added to the average accuracy value in the interval of 50ms before to 50ms after stimulus onset. Then, the onset time was identified as the point where the accuracy was significantly greater than this value for five consecutive time windows.

Preferred SF

Preferred SF for a given neuron was calculated as follows,

where P SF is the preferred SF, fRi is the average firing rate of the neuron for Ri, and cRi is −2 for R1, −1 for R2, …, 2 for R5. When P SF > 0, the neuron exhibits higher firing rates for higher SF ranges on average and vice versa. To identify the number of neurons responding to a specific SF range higher than others, we performed an ANOVA analysis with a significance level of 0.05. Then, we picked the SF range with the highest firing rate for that neuron.

SF profile

To form the SF profiles, a quadratic curve was fitted to the neuron response from R1 to R5, using exclusively scrambled stimuli. Each trial was treated as an individual sample. Neurons were categorized into three groups based on the extremum point of the fitted curve: i) extremum is lower than R2, ii) between R2 and R4, and iii) greater than R4. Within the first group, if the neuron’s response in R1 and R2 significantly exceeded (or fell below) R4 and R5, the SF profile was classified as LSF preferred (or HSF preferred). The same procedure went for the third group. Considering the second group, if the neuron response in R2 was significantly higher (lower) than the response of R1 and R5, the neuron profile identified as U (IU). Neurons not meeting any of these criteria were grouped under the flat category.

To establish sub-populations of SF/category-sorted neurons, we initially sorted the neurons according to their accuracy to decode the SF/category. Subsequently, a sliding window of size 20 was employed to select adjacent neurons in the SF or category-sorted list. Consequently, the first sub-population comprised the initial 20 neurons exhibiting the lowest individual accuracy in decoding the SF/category. In comparison, the last sub-population encompassed the top 20 neurons with the highest accuracy in decoding SF/category.

SNC and CMI

To examine the influence of individual neurons on population-level decoding, we introduced the concept of the SNC. It measures the reduction in decoding performance when a single neuron is removed from the population. We systematically removed each neuron from the population one at a time and measured the corresponding drop in accuracy compared to the case where all neurons were present.

To quantify the CMI between pairs of neurons, we discretized their response patterns using ten levels of uniformly spaced bins. The CMI is calculated using the following formula.

where ni and nj represent the discretized responses of the two neurons, and C represents the conditioned variable, which can be the category (face/non-face) or the SF range (LSF (R1 and R2) and HSF (R4 and R5)). We normalized the CMI by subtracting the CMI obtained from randomly shuffled responses and added the average CMI of SF and category. CMI calculation enables us to assess the degree of information shared or exchanged between pairs of neurons, conditioned on the category or SF while accounting for the underlying probability distributions.

Sparseness analysis

The sparseness analysis was conducted on the LDA weights, regarded as a measure of task relevance. To calculate the sparseness of the LDA weights, the neuron responses were first normalized using the z-score method. Then, the sparseness of the weights associated with the neurons in the LDA classifier was computed. The sparseness is computed using the following formula.

where w is the neuron weight in LDA, E(w2) represents the mean of the squared weights of the neurons. The maximum sparseness occurs when only one neuron is active, whereas the minimum sparseness occurs when all neurons are equally active.

Deep neural network analysis

To compare our findings with those derived from deep neural networks, we commenced by curating a diverse assortment of CNN architectures. This selection encompassed ResNet18, ResNet34, VGG11, VGG16, InceptionV3, EfficientNetb0, CORNet-S, CORTNet-RT, and CORNet-z, strategically chosen to offer a comprehensive overview of SF processing capabilities within deep learning models. Our experimentation spanned the utilization of both randomly initialized weights and pre-trained weights sourced from the ImageNet dataset. This dual approach allowed us to assess the influence of prior knowledge embedded in pre-trained weights on SF decoding. In the process of extracting feature maps, we fed our stimulus set to the models, capturing the feature maps from the last four layers, excluding the classifier layer. Our results were primarily rooted in the final layer (preceding classification), yet they demonstrated consistency across all layers under examination. For classification and SF profiling, our methodology mirrored the procedures employed in our neural response analysis.